Breadlines

Regions of Deprivation

Introduction

In addition to obvious networks, we can also use graph theory to think though some non-obvious application. One place this often crops up in planning is to think about adjacency/proximity as a graph. See my post on employment centers and paper on delineating areas.

In this post, I will demonstrate how to use networks for other spatial analysis. We can think of polygons as nodes and two nodes are connected if they share some part of the boundary

Acquire data

I also want to demonstrate Kyle Walker’s excellent tigris and tidycensus packages that allow us to conveniently download Census data.

First you need to acquire an API key from census.

We will use the API key to access American Community Survey (ACS) data.The American Community Survey (ACS) is an ongoing survey by the U.S. Census Bureau. It regularly gathers information previously contained only in the long form of the decennial census, such as ancestry, educational attainment, income, language proficiency, migration, disability, employment, and housing characteristics. About 3.5 million people are surveyed every year and multi-year summaries are provided at different levels of geography (not at block level).

Because it is a survey, each variable comes with a margin of error. These are widely ignored in practise and there is a tendency to take the estimates at face value. This practise is problematic.

library(tidycensus)
library(sf)
library(tidyverse)
library(mapview)
library(RColorBrewer)
library(igraph)
library(tigris)

census_api_key(census_apikey)
varnames <- load_variables(2016, "acs5", cache = TRUE)
write_csv(varnames, path="/Users/kaza/Desktop/acs_varnames.csv")

It is useful to write the variables and their descriptions into an external file that you can use in other programs and this is usually handy for finding variables of interest. In this post, I am going to use poverty rate and unemployment rate as examples. The choice of these variables is arbitrary.

Poverty rate

For example, to find the poverty rate in Carolinas’ census tracts, you can use B17001_01 (persons for whom poverty level is tracked) and B17001_002 (persons whose income in the last 12 months was below poverty level).

NC_SC <- c('37', '45') # FIPS code for NC and SC.

# Download CBSAs and MSAs for reference.

cbsa <- core_based_statistical_areas() %>% st_as_sf()  # Download Core Based Statistical Areas file
states_shps <- states() %>% 
               st_as_sf() %>%
               filter(GEOID %in% NC_SC) # Limit to the Carolinas
 
MSA <- cbsa[states_shps,] %>%  # Filter CBSA that is intersected/touched by the state boundary.
            filter(LSAD=="M1")  # Limit to Metropolitan areas
pov <- reduce(
  map(NC_SC, function(x) {
    get_acs(geography = "tract", variables = "B17001_002",  summary_var = 'B17001_001',
            state = x, geometry = TRUE, year = 2016)
  }), 
  rbind
) ## read up on these map and reduce functions in purrr. They are key to functional programming paradigm
nrow(pov)
# [1] 3298

pov_rate <- pov %>%
  rename(population = summary_est) %>%
  filter(population>0)%>%
  mutate(pov_rate = estimate/population) %>%
  select(GEOID, NAME, population, pov_rate)

library(mapview)
mapview(pov_rate, 
        zcol=c('pov_rate'),
        col.regions = brewer.pal(5, "Reds"),
        at=quantile(pov_rate$pov_rate, seq(0,1,.2)),
        lwd=0
        )

Unemployment rate

Unemployment rate is trickier, because it needs to track both the labour force as well as unemployed persons. It turns out that ACS only provides this data for males and females separtely and by differnt age categories. So we need to assemble them and combine them.

# variables for labour force for age groups 16-64
lf_m <- paste("B23001_", formatC(seq(4,67,7), width=3, flag="0"), "E", sep="")
lf_f <- paste("B23001_", formatC(seq(90,153,7), width=3, flag="0"), "E", sep="")

lf <- reduce(
  map(NC_SC, function(x){
    get_acs(geography='tract', variables = c(lf_m, lf_f), state=x, year=2016)
  }),
  rbind
)
# variables for unemployed for age groups 16-64
unemp_m <- paste("B23001_", formatC(seq(8,71,7), width=3, flag="0"), "E", sep="")
unemp_f <- paste("B23001_", formatC(seq(94,157,7), width=3, flag="0"), "E", sep="")

unemp <- reduce(
  map(NC_SC, function(x){
    get_acs(geography='tract', variables = c(unemp_m, unemp_f), state=x, year=2016)
  }),
  rbind
)
lf_t <- lf %>% 
  group_by(GEOID) %>%
  summarize(lf_est = sum(estimate, na.rm=T))

unemp_t <- unemp %>% 
  group_by(GEOID) %>%
  summarize(unemp_est = sum(estimate, na.rm=T))

unemp_rate <- merge(lf_t, unemp_t, by='GEOID') %>% 
  filter(lf_est >0) %>%
  mutate(unemp_rate = unemp_est/lf_est)

Exercise

  • Compute unemployment rate for more meaningful classes (Women, Youth etc.) and visualise it. Are there differences in spatial distribution?

We can arbitarily define distress to be a combination of poverty and unemployment. In this instance, I am picking 20% and 10% as the threshold respectively.

tract_stats <- merge(pov_rate, unemp_rate, by='GEOID')
distressed_tracts <- tract_stats %>%
                     filter(unemp_rate > .10 & pov_rate > .20)

mapview(distressed_tracts)

Space as a topological network

There are reasons to presume that clusters of distress is more problematic than isolated distress. We could use the contiguity of distressed tracts to our advantage to figure where these clusters are and how they are shaped.

It turns out that the contiguity calculations in sf are still a little slower than older packages such as spdep. So I will convert the simple features into a Spatial object and use older packages.

library(rgeos)
library(spdep)

dt <- as(distressed_tracts, "Spatial") # Convert to spatial
temp1 <- gUnarySTRtreeQuery(dt) # Construct list of polygons that are likely to intersect/touch by first looking at bounding boxes.
dt_nb <- poly2nb(dt, queen=FALSE, foundInBox=temp1, row.names = dt$GEOID) # Construct a neighborhood object

plot(dt)
plot(dt_nb, coordinates(dt), col='red', points=F, add=T) # Quickly visualise the graph


distressed_tracts_graph <- dt_nb %>%  
  nb2mat(zero.policy=TRUE, style="B") %>% # Create a Binary Adjacency Matrix.
  graph.adjacency(mode='undirected', add.rownames=NULL) # construct an undirected graph from the adjacency matrix.

cl <- clusters(distressed_tracts_graph) # This decomposes the graph into connected subgraphs.

sum(dt$GEOID!= names(cl$membership)) # this is 0. So we can safely assign the clustermembersip variable by rows rather than doing a merge.
# [1] 0
dt$clustermembership <- cl$membership


### Visualisation

# Adding the MSA boundaries for effect to see if the deprivation is primiarly a metropolitan phenomenon or outside the metropolitan areas

pal <- colorRampPalette(brewer.pal(8, "Dark2"))
numColors <- cl$membership %>% unique() %>% length()

dt %>% st_as_sf() %>%
  mapview(zcol="clustermembership", 
          col.regions=pal(numColors),
          legend = FALSE,
          lwd = 0
          ) +
  
 mapview(MSA, col.regions = "gray95", color='Red', lwd=2, alpha=.4)

Clusters function decomposes the graph in subgraphs, i.e. nodes belong to a subgraph, if there is a path between them. If there is not, they belong to different subgraphs.

ggplot() + 
  geom_histogram(aes(cl$csize))

From the above histogram it is relatively obvious that the many (64.67 %) of the distressed census tracts are are relatively isolated (cluster size is \(\le\) 2). Many of them are even in clusters of size 1. We could see if there is a geograhic pattern in these isolated clusters.

smallclusters <- (1:cl$no)[cl$csize <=2]
mapview(dt[dt$clustermembership %in% smallclusters,]) +  # Filter all tracts that are within 'small' clusters of deprivation.
  mapview(MSA, col.regions = "gray95", color='Red', lwd=2, alpha=.4)

Now that different tracts are assigned to different clusters, we can summarise the poverty rate in different clusters. Within in each component of the graph, we can also identify a community structure if you wish, using techniques from an earlier post. I leave these as exercises.

In some instances, it may be useful to see if the clusters are stringy’ or if they are ‘blobs’. i.e., if the clusters follow linear features (such as roads, valleys, rivers etc) or if they adjacency is shared among multiple polygons of the same component. To do this, we could rely on diameter of the graph, which is the longest path between any two nodes in graph. If the graph is stringy, it will have a large diameter.

In this instance, we want to compute the diameters of each component separately. (diameter of the graph is \(\infty\), as it is not fully connected. )


diameters_of_graphs <-   distressed_tracts_graph %>%
       decompose() %>%
       sapply(function(x)diameter(x)) %>% 
       as.tibble()

names(diameters_of_graphs)[1] <- 'dia'

diameters_of_graphs$cl_membership <- 1:(distressed_tracts_graph %>% decompose() %>% length())

dt <- merge(dt, diameters_of_graphs, by.x='clustermembership', by.y='cl_membership')

# Quickly showing the stringy and blobby (real words) distressed regions
mapview(dt[dt$dia > 10,'GEOID'], col.regions='red') +
mapview(dt[dt$dia < 4,'GEOID'], col.regions = 'blue')

Most of large spatial concentrations of distress is primarily in the rural ares (midlands of South Carolina and Eastern Carolina Plains). However, this may be misleading. There are concentrations of distress within the urban regions.

Since census tracts are very small in urban centers within metropolitan areas, they are more likely to be clustered and separated. So the unequal sizes of geography affects the conclusions you can draw from the map. Perhaps use a different visualisation and further exploratory analysis to identify the truly disadvantaged.


Exercise

  • Limiting your analysis to Metropolitan Areas in Carolinas, what conclusions can you draw about the shape of the clusters?
  • Does limiting your analysis to a single state, affect your results?
  • Does changing the unit of analysis matter? Say e.g. instead of the tract, we use counties for the poverty and unemployment characteristics, how are the conclusions different?

Further analysis

Instead of categorising the areas of distress by dichotomising, one could use local Moran's I to identify locations high values of a continous variable (say poverty rate) surrounded by other high values areas (or low value areas) and test their statistical signifance. local.moran is a function in spdep and relies on the neighbourhood graph. I leave this as an exercise.

Conclusions

Borrowing some concepts from network analysis, helps us in many analytical tasks with regards to space. Todd BenDor and I argued that in some instances it may be beneficial to think about space as networks for modelling purposes too. Tools from graph theory have been incredibly useful to me in a number of different projects.

Related

comments powered by Disqus